我们讨论VMware如何解决以下挑战来利用数据,以便操作基于ML的异常检测系统,以检测我们的软件定义数据中心(SDDC)企业部署中的性能问题:(i)由于较重依赖,标签稀缺和标签偏差不可提供的人类注释器,和(ii)数据漂移,由于不断变化的工作量模式,软件堆栈和基础硬件。我们的异常检测系统已在生产中部署多年,并已成功检测到许多主要的性能问题。我们证明通过解决这些数据挑战,我们不仅提高了我们的性能异常检测模型的准确性30%,而且还可以确保模型性能永远不会降低时间。
translated by 谷歌翻译
Existing object detection methods are bounded in a fixed-set vocabulary by costly labeled data. When dealing with novel categories, the model has to be retrained with more bounding box annotations. Natural language supervision is an attractive alternative for its annotation-free attributes and broader object concepts. However, learning open-vocabulary object detection from language is challenging since image-text pairs do not contain fine-grained object-language alignments. Previous solutions rely on either expensive grounding annotations or distilling classification-oriented vision models. In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data. We formulate object-language alignment as a set matching problem between a set of image region features and a set of word embeddings. It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way. Extensive experiments on two benchmark datasets, COCO and LVIS, demonstrate our superior performance over the competing approaches on novel categories, e.g. achieving 32.0% mAP on COCO and 21.7% mask mAP on LVIS. Code is available at: https://github.com/clin1223/VLDet.
translated by 谷歌翻译
对象看起来和声音的方式提供了对其物理特性的互补反射。在许多设置中,视觉和试听的线索都异步到达,但必须集成,就像我们听到一个物体掉落在地板上,然后必须找到它时。在本文中,我们介绍了一个设置,用于研究3D虚拟环境中的多模式对象定位。一个物体在房间的某个地方掉落。配备了摄像头和麦克风的具体机器人剂必须通过将音频和视觉信号与知识的基础物理学结合来确定已删除的对象以及位置。为了研究此问题,我们生成了一个大规模数据集 - 倒下的对象数据集 - 其中包括64个房间中30个物理对象类别的8000个实例。该数据集使用Threedworld平台,该平台可以模拟基于物理的影响声音和在影片设置中对象之间的复杂物理交互。作为解决这一挑战的第一步,我们基于模仿学习,强化学习和模块化计划,开发了一组具体的代理基线,并对这项新任务的挑战进行了深入的分析。
translated by 谷歌翻译
作为一种主动网络安全保护方案,入侵检测系统(IDS)承担以恶意网络流量形式检测网络攻击的重要责任。入侵检测技术是ID的重要组成部分。目前,许多学者已经对入侵检测技术进行了广泛的研究。但是,为大规模网络流量数据开发有效的入侵检测方法仍然很困难。由于生成的对抗网络(GAN)具有强大的建模功能,可用于复杂的高维数据,因此它们为解决此问题提供了新的想法。在本文中,我们提出了一种基于Ebgan的入侵检测方法IDS-Ebgan,该方法将网络记录归类为正常流量或恶意流量。 IDS-Ebgan中的发电机负责将培训中的原始恶意网络流量转换为对抗性恶意示例。这是因为我们想使用对抗性学习来提高歧视者检测恶意流量的能力。同时,鉴别器采用自动编码器模型。在测试过程中,IDS-Ebgan使用歧视器的重建错误来对流量记录进行分类。
translated by 谷歌翻译
人类视觉感知的关键方面是能够将视觉场景分解为单个对象并进一步进入对象部分,形成部分整个层次结构。这种复合结构可以诱导丰富的语义概念和关系,从而在视觉信号的解释和组织中发挥着重要作用,以及视觉感知和推理的概括。但是,现有的视觉推理基准主要专注于物体而不是零件。基于完整的部分整个层次结构的视觉推理比以前粒度概念,更丰富的几何关系和更复杂的物理学所致的对象的推理更具挑战性。因此,为了更好地为基于部分的概念,关系和物理推理服务,我们介绍了一个名为PTR的新型大规模诊断视觉推理数据集。 PTR包含大约70k RGBD合成图像,具有地面真理对象和有关语义实例分段,颜色属性,空间和几何关系的部分级别注释,以及诸如稳定性的某些物理性质。这些图像与700K机生成的问题配对,涵盖各种类型的推理类型,使其成为视觉推理模型的良好测试平台。我们在这个数据集上检查了几种最先进的视觉推理模型,并观察到他们在人类可以容易地推断正确答案的情况下仍然存在许多令人惊讶的错误。我们认为,此数据集将开辟基于零件推理的新机会。
translated by 谷歌翻译
不完美的标签在现实世界数据集中无处不在,严重损害了模型性能。几个最近处理嘈杂标签的有效方法有两个关键步骤:1)将样品分开通过培训丢失,2)使用半监控方法在错误标记的集合中生成样本的伪标签。然而,由于硬样品和噪声之间的类似损失分布,目前的方法总是损害信息性的硬样品。在本文中,我们提出了PGDF(先前引导的去噪框架),通过生成样本的先验知识来学习深层模型来抑制噪声的新框架,这被集成到分割样本步骤和半监督步骤中。我们的框架可以将更多信息性硬清洁样本保存到干净标记的集合中。此外,我们的框架还通过抑制当前伪标签生成方案中的噪声来促进半监控步骤期间伪标签的质量。为了进一步增强硬样品,我们在训练期间在干净的标记集合中重新重量样品。我们使用基于CiFar-10和CiFar-100的合成数据集以及现实世界数据集WebVision和服装1M进行了评估了我们的方法。结果表明了最先进的方法的大量改进。
translated by 谷歌翻译
视觉和语言导航(VLN)是一种任务,即遵循语言指令以导航到目标位置的语言指令,这依赖于在移动期间与环境的持续交互。最近的基于变压器的VLN方法取得了很大的进步,从视觉观测和语言指令之间的直接连接通过多模式跨关注机制。然而,这些方法通常代表通过使用LSTM解码器或使用手动设计隐藏状态来构建反复变压器的时间上下文作为固定长度矢量。考虑到单个固定长度向量通常不足以捕获长期时间上下文,在本文中,我们通过显式建模时间上下文来引入具有可变长度存储器(MTVM)的多模式变压器,通过模拟时间上下文。具体地,MTVM使代理能够通过直接存储在存储体中的先前激活来跟踪导航轨迹。为了进一步提高性能,我们提出了内存感知的一致性损失,以帮助学习随机屏蔽指令的时间上下文的更好关节表示。我们在流行的R2R和CVDN数据集上评估MTVM,我们的模型在R2R看不见的验证和测试中提高了2%的成功率,并在CVDN测试集上减少了1.6米的目标进程。
translated by 谷歌翻译
Optimization in multi-task learning (MTL) is more challenging than single-task learning (STL), as the gradient from different tasks can be contradictory. When tasks are related, it can be beneficial to share some parameters among them (cooperation). However, some tasks require additional parameters with expertise in a specific type of data or discrimination (specialization). To address the MTL challenge, we propose Mod-Squad, a new model that is Modularized into groups of experts (a 'Squad'). This structure allows us to formalize cooperation and specialization as the process of matching experts and tasks. We optimize this matching process during the training of a single model. Specifically, we incorporate mixture of experts (MoE) layers into a transformer model, with a new loss that incorporates the mutual dependence between tasks and experts. As a result, only a small set of experts are activated for each task. This prevents the sharing of the entire backbone model between all tasks, which strengthens the model, especially when the training set size and the number of tasks scale up. More interestingly, for each task, we can extract the small set of experts as a standalone model that maintains the same performance as the large model. Extensive experiments on the Taskonomy dataset with 13 vision tasks and the PASCAL-Context dataset with 5 vision tasks show the superiority of our approach.
translated by 谷歌翻译
We present a new method for generating controllable, dynamically responsive, and photorealistic human animations. Given an image of a person, our system allows the user to generate Physically plausible Upper Body Animation (PUBA) using interaction in the image space, such as dragging their hand to various locations. We formulate a reinforcement learning problem to train a dynamic model that predicts the person's next 2D state (i.e., keypoints on the image) conditioned on a 3D action (i.e., joint torque), and a policy that outputs optimal actions to control the person to achieve desired goals. The dynamic model leverages the expressiveness of 3D simulation and the visual realism of 2D videos. PUBA generates 2D keypoint sequences that achieve task goals while being responsive to forceful perturbation. The sequences of keypoints are then translated by a pose-to-image generator to produce the final photorealistic video.
translated by 谷歌翻译
This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR). EURO adopts the state-of-the-art UASR learning method introduced by the Wav2vec-U, originally implemented at FAIRSEQ, which leverages self-supervised speech representations and adversarial training. In addition to wav2vec2, EURO extends the functionality and promotes reproducibility for UASR tasks by integrating S3PRL and k2, resulting in flexible frontends from 27 self-supervised models and various graph-based decoding strategies. EURO is implemented in ESPnet and follows its unified pipeline to provide UASR recipes with a complete setup. This improves the pipeline's efficiency and allows EURO to be easily applied to existing datasets in ESPnet. Extensive experiments on three mainstream self-supervised models demonstrate the toolkit's effectiveness and achieve state-of-the-art UASR performance on TIMIT and LibriSpeech datasets. EURO will be publicly available at https://github.com/espnet/espnet, aiming to promote this exciting and emerging research area based on UASR through open-source activity.
translated by 谷歌翻译